Dataset Cleaning¶
Starting Anew¶
By default, a Dataset
will attempt to reinitialise at launch. In short, this means that it looks for a file that looks like itself. If it finds such a file, it will recreate itself from it.
For the user, this brings the benefits that your workflow is robust against restarts and data loss. However this does mean that datasets act like “accumulators”, constantly gaining attributes and runners as you test.
There can come a point where you realise that something you set earlier could be causing problems (or simply isn’t needed), this tutorial will run through some methods of dealing with these situations.
Lets start with the most basic case, skip
:
[1]:
from remotemanager import Dataset
def f(inp):
return inp
ds = Dataset(f,
skip=False # new option!
)
When set to False
, the skip
argument will force the Dataset to start anew, and thus any variables that were stored are lost. This can also be done by deleting the database file (and in fact, this is what is done here internally, though then a new one is created). The filename is a combination of name
-dataset
-uuid
.yaml. However if no name
is set, it is omitted.
Deleting this file has the same effect as skip
, though only once. It will be created along with the dataset.
The database filename can be seen using Dataset.dbfile
[2]:
ds.dbfile
[2]:
'dataset-9ebf1589.yaml'
You can also force this value to be whatever you want, but only at the dataset initialisation:
[3]:
ds_unique = Dataset(f,
dbfile = 'set_a_name_here', # new option!
)
print(f'the database file for this dataset is now {ds_unique.dbfile}')
the database file for this dataset is now set_a_name_here.yaml
Now whenever you initialise a dataset with this filename, it will attempt to connect with that file.
Finer options¶
This is all well and good, but what if you don’t want to blow up your dataset and start again? For example, you know that one of your runners is causing issues and needs to be removed. Well there are options for this too.
Lets append some runs, and experiment with removing them.
Lets also create a function to show us some information about our runners.
[4]:
for run in range(7):
ds.append_run(args={'inp': run})
def print_runs():
for r_id, runner in ds.runner_dict.items():
print(f'{r_id}: {runner.short_uuid} | {runner.args}')
appended run runner-0
appended run runner-1
appended run runner-2
appended run runner-3
appended run runner-4
appended run runner-5
appended run runner-6
[5]:
print_runs()
runner-0: 9e62f0bc | {'inp': 0}
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}
Now, we can look at all the ways of removing a run. We do this with ds.remove_run(id)
. Here, id
is a “smart” value, and can be int
, str
or dict
, the function will perform slightly differently based on the input type:
An
int
will be treated like a list index, and the runner at that id will be removed.A
dict
will be treated like arguments, and the runner with those args will be searched for.str
is first checked against the runner names, and and is then checked against theuuid
of each runner.short and long uuids can be used (8 or 64 chars)
Runner Removal¶
Firstly, if you know the index of the runner within ds.runners
, you can pass that id:
[6]:
ds.remove_run(0)
print_runs()
removed runner dataset-9ebf1589-runner-0
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}
Runner 0 has dissappeared!
Note
This function always removes the runner at that index. So in this case if we call again with index 0, runner-1 would be removed, as it is the first.
Next, is the uuid. This can be found by printing the uuid of a runner you have access to:
[7]:
r_uuid = ds.runners[2].uuid
print(r_uuid)
r_short_uuid = ds.runners[3].short_uuid
print(r_short_uuid)
1fc9add953e10b337317695f16173d23ce790b3d03d78192fafb55b9e6bc51e7
6d5e0646
[8]:
ds.remove_run(r_uuid)
ds.remove_run(r_short_uuid)
print_runs()
removed runner dataset-9ebf1589-runner-3
removed runner dataset-9ebf1589-runner-4
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}
We grabbed the uuids of runners at id 2 and 3, which in the runner list would be runner numbers 3 and 4 (as we removed 0). These two have also dissappeared.
If you don’t know the id of the runner and don’t have their uuids stored, you can remove by args. This attempts to match passed args with those that the runners have stored and will attempt to remove them. This is arguably the most flexible (and useful) method, though is less efficient than other approaches.
For example, if you append run with {'inp': 6}
, you may remove that runner by calling:
[9]:
ds.remove_run({'inp': 6})
print_runs()
removed runner dataset-9ebf1589-runner-6
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
runner-5: 552e16a0 | {'inp': 5}
Looks like runner 6 (who had inp: 6
) has also gone.
Finally, removing via id
may be confusing if runs have already been removed, (i.e. you don’t have a continuous, zero-indexed list). Thus, you can remove by the actual id by passing remove_run("runner-{n}")
[10]:
ds.remove_run('runner-5')
print_runs()
removed runner dataset-9ebf1589-runner-5
runner-1: d3c4ab40 | {'inp': 1}
runner-2: 1628650a | {'inp': 2}
Leaving us with just runners 1 and 2.
This function also returns True or False depending on if it removed a runner or not:
[11]:
print('removed runner-2?:', ds.remove_run(1))
print('removed runner-3?:', ds.remove_run(2))
print('\nfinal runner list:')
print_runs()
removed runner dataset-9ebf1589-runner-2
removed runner-2?: True
removed runner-3?: False
final runner list:
runner-1: d3c4ab40 | {'inp': 1}
Clearing Runners¶
There is one additional option for removing runners, and that’s wipe_runs
. This removes all runs from the dataset:
Note
We’re using confirm=False
here to allow the notebook to be tested, but be aware that this will skip the confirmation dialog
[12]:
ds.wipe_runs(confirm=False)
print(ds.runners)
[]
Persistence¶
All these changes are of course, saved to the database when performed, so be careful when using them. If we simulate restarting this notebook (or a different notebook that also uses this dataset), we will see no runners:
[13]:
ds = Dataset(f)
print(ds.runners)
ds.append_run({'inp': 2})
print(ds.runners)
[]
appended run runner-0
[dataset-9ebf1589-runner-0]
re-adding runners¶
Adding runners back to a dataset that has “holes” within its runner storage will cause no harm. Runners will be added to fill any missing spaces then continue as normal after that:
[14]:
for run in range(10):
ds.append_run({'inp': run})
appended run runner-1
appended run runner-2
runner runner-0 already exists
appended run runner-3
appended run runner-4
appended run runner-5
appended run runner-6
appended run runner-7
appended run runner-8
appended run runner-9
[15]:
print_runs()
runner-0: 1628650a | {'inp': 2}
runner-1: 9e62f0bc | {'inp': 0}
runner-2: d3c4ab40 | {'inp': 1}
runner-3: 1fc9add9 | {'inp': 3}
runner-4: 6d5e0646 | {'inp': 4}
runner-5: 552e16a0 | {'inp': 5}
runner-6: 47fc1095 | {'inp': 6}
runner-7: df3419b5 | {'inp': 7}
runner-8: 8b0ff3e5 | {'inp': 8}
runner-9: ca823715 | {'inp': 9}
Note here how we now have 10 runners as expected. Runner with inp: 2
has been skipped, as it already exists.
Run Args¶
Now you know how to remove runners, what about run args? If you want to update a value, in most cases you can simply overwrite the value. Though if a run argument is causing issues, you can also delete it with the usual python syntax.
Note
Starting in version 0.10.0
, run_args are no longer accessible at the dataset level (i.e. ds.mpi
), and must be accessed via the run_args
property.
[16]:
ds.set_run_arg("mpi", 16)
print(ds.run_args["mpi"])
16
[17]:
print(ds.global_run_args)
del ds.run_args["mpi"]
print(ds.global_run_args)
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote', 'mpi': 16}
{'skip': True, 'force': False, 'asynchronous': True, 'local_dir': 'temp_runner_local', 'remote_dir': 'temp_runner_remote'}
Cleaning Directories¶
Added in version 0.5.9.
Too much clutter from testing? Dataset has some functions which help with deleting unwanted data:
dataset.wipe_local()
will attempt to delete any local directoriesdataset.wipe_remote()
will attempt to delete any remote and run directories
If you really want to reset, dataset also provdes a dataset.hard_reset()
function, which will do all of the above, delete the database file and then clear any runners. This essentially gives you a like-new dataset.